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Abstract 

The true costs of high performance computing are currently dominated by software. Addressing these costs requires 
shifting to high productivity languages such as Matlab. MatlabMPI is a Matlab implementation of the Message Passing 
Interface (MPI) standard and allows any Matlab program to exploit multiple processors. MatlabMPI currently implements 
the basic six functions that are the core of the MPI point-to-point communications standard. The key technical innovation 
of MatlabMPI is that it implements the widely used MPI "look and feel" on top of standard Matlab file I/O, resulting 
in an extremely compact ('--'250 lines of code) and "pure" implementation which runs anywhere Matlab runs, and on any 
heterogeneous combination of computers. The performance has been tested on both shared and distributed memory parallel 
computers (e.g. Sun, SGI, HP, IBM and Linux). MatlabMPI can match the bandwidth of C based MPI at large message 
sizes. A test image filtering application using MatlabMPI achieved a speedup of ^-^300 using 304 CPUs and '--^15% of the 
theoretical peak (450 Gigaflops) on an IBM SP2 at the Maui High Performance Computing Center In addition, this entire 
parallel benchmark application was implemented in 70 software-lines-of-code (SLOC) yielding 0.85 Gigaflop/SLOC or 4.4 
CPUs/SLOC, which are the highest values of these software price performance metrics ever achieved for any application. 
The MatlabMPI software will be made available for download. 
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^ Matlab ^ is the dominant programming language for implementing numerical computations and is widely used for algo- 
^ rithm development, simulation, data reduction, testing and system evaluation. The popularity of Matlab is driven by the high 

■ ■ ■ f)roductivity that is achieved by users because one line of Matlab code can typically replace ten lines of C or Fortran code. 
Many Matlab programs can benefit from faster execution on a parallel computer, but achieving this goal has been a signifi- 
cant challenge. There have been many previous attempts to provide an efficient mechanism for running Matlab programs on 
parallel computers [|H|, |, 0, g |, [l|, |ll||T2Hl|, |l|]. These efforts of have faced numerous challenges and none have 



received widespread acceptance. 

In the world of parallel computing the Message Passing Interface (MPI) is the de facto standard for implementing 
programs on multiple processors. MPI defines C and Fortran language functions for doing point-to-point communication 
in a parallel program. MPI has proven to be an effective model for implementing parallel programs and is used by many 
of the worlds' most demanding applications (weather modeling, weapons simulation, aircraft design, and signal processing 
simulation). 

MatlabMPI consists of a set of Matlab scripts that implements a subset of MPI and allows any Matlab program to be 
run on a parallel computer The key innovation of MatlabMPI is that it implements the widely used MPI "look and feel" on 
top of standard Matlab file I/O, resulting in a "pure" Matlab implementation that is exceedingly small (~250 lines of code). 
Thus, MatlabMPI will run on any combination of computers that Matlab supports. 

The next section describes the implementation and functionality provided by MatlabMPI. Section three presents results 
on the bandwidth performance of the library from several parallel computers. Section four uses an image processing appli- 
cation to show the scaling performance and the very high software price performance that can be achieved using MatlabMPI. 
Section five presents our conclusions and plans for future work. 

*This work is sponsored by the High Performance Computing Modernization Office, under Air Force Contract F19628-00-C-0002. Opinions, 
intemretatinns. cnnchisions and recommendations are those of the author and are not necessarilv endorsed hv the Denartment of Defense. 



2 Implementation 



The central innovation of MatlabMPI is the use of Matlab's built in file I/O capabilities, which allow any Matlab variable 
(matrices, arrays, structures, ...) to be written and read by Matlab running on any machine, and eliminates the need to write 
complicated buffer packing and unpacking routines. The approach used in MatlabMPI is extremely simple and is illustrated 
in Figure [I[ The sender saves a Matlab variable to a data file and when the file is complete the sender touches a lock file. 
The receiver continuously checks for the existence of the lock file; when it is exists the receiver reads in the data file and then 
deletes both the data file and the lock file. An example of a basic send and receive Matlab program is shown below 



MPI_Init; 

comm = MPI_COMM_WORLD; 
comm_size = MPI_Comm_size (comm) ; 
my_rank = MPI_Comm_rank (comm) ; 
source = 0; 
dest = 1; 
tag = 1; 

if {comm_size == 2) 

if {my_rank == source) 
data = 1:10; 

MPI_Send (data, dest, tag, comm) ; 
end 

if {my_rank == dest) 

data=MPI_Recv (source, tag, comm) 
end 
end 

MPI_Finalize; 
exit; 



Initialize MPI. 

Create communicator. 

Get size. 

Get rank . 

Set source. 

Set destination. 

Set message tag. 

Check size. 

If source. 

Create data. 

Send data. 

If destination. 
Receive data. 



Finalize Matlab MPI 
Exit Matlab 



The structure of the above program is very similar to MPI programs written in C or fortran, but with the convenience of a 
high level language. The first part of the program sets up the MPI world; the second part differentiates the program according 
to rank and executes the communication; the third part closes down the MPI world and exits Matlab. If the above program 
were written in a matlab file called SendReceive .m, it would be executed by calling the following command from the 
Matlab prompt: 

eval ( MPI_Run (' SendReceive' , 2 , machines ) ); 
Where the machines argument can be any of the following forms: 
machines = {}; Run on the local host. 

machines = {'machinel' 'machine2'}; Run on a multi processors. 

machines = {'machinel :dirl' 'machine2 :dir2' }; Run on a multi processors and communi- 
cate using dirl and dir2, which must be visible to both machines. 

The MPI_Run command launches a Matlab script on the specified machines with output redirected to a file. If the rank=0 
process is being run on the local machine, MPI_Run returns a string containing the commands to initialize MatlabMPI, 
which allows MatlabMPI to be invoked deep inside of a Matlab program in a manner similar to fork-join model employed 
in OpenMP. 

The SendReceive example illustrates the basic six MPI functions (plus MPI_Run) that have been implemented in 
MatlabMPI 

MP I_Run Runs a matlab script in parallel. 
MP I _I nit Initializes MPI. 

MP I _C omm_s i z e Gets the number of processors in a communicator. 
MPI_Comm_rank Gets the rank of current processor within a communicator. 



MPI_Send Sends a message to a processor (non-blocking). 
MPI_Recv Receives message from a processor (blocking). 
MPI_Finalize Finalizes MPI. 

For convenience, three additional MPI functions have also been implemented along with two utility functions that provide 
important MatlabMPI functionality, but are outside the MPI specification 

MP I .Abort Function to kill all matiab jobs started by MatlabMPI. 

MPI^cast Broadcast a message (blocking). 

MP U> robe Returns a list of all incoming messages. 

MatMPI_Savejniessages MatlabMPI function to prevent messages from being deleted (useful for debug- 
ging). 

MatMPI_Delete_all MatlabMPI function to delete all files created by MatlabMPI. 

MatlabMPI handles errors the same as Matiab, however running hundreds of copies does bring up some additional issues. 
If an error is encountered and the Matiab script has an "exit" statement then all the Matiab processes will die gracefully. If a 
Matiab job is waiting for a message that will never arrives than it needs to be killed with the MP I_Abort command. In this 
situation, MatlabMPI can leave files which need to be deleted with the MatMPI_Delete_all command. 

On shared memory systems, MatlabMPI only requires a single Matiab license since any user is allowed to launch many 
Matiab sessions. On a distributed memory system, MatlabMPI requires one Matiab license per machine. Because MatlabMPI 
uses file I/O for communication, there must also be a directory that is visible to every machine (this is usually also required 
in order to install Matiab). This directory defaults to the directory that the program is launched from. 

3 Bandwidth 

The vast majority of potential Matiab applications are "embarrassingly" parallel and require minimal performance out of 
MatlabMPI. These applications exploit coarse grain parallelism and communicate rarely (if at all). Never-the-less, measuring 
the communication performance is useful for determining which applications are most suitable for MatlabMPI. 

MatlabMPI has been run on several Unix platforms. It has been benchmarked and compared to the performance of C MPI 
on the SGI Origin2000 housed at Boston University. These results indicate that for large messages ('^1 MByte) MatlabMPI 
is able to match the performance of C MPI (see Figure |2|). For smaller messages, MatlabMPI is dominated by its latency 
(~35 milliseconds), which is significantly larger than C MPI. These results have been reproduced using a SGI Origin2000 
and a Hewlett Packard workstation cluster (connected with 100 Mbit ethernet) at Ohio State University (see Figure ||). 

The above bandwidth results were all obtained using two processors engaging in bi-directional sends and receives. Such 
a test does a good job of testing the individual links on an a multi-processor system. To more broadly test the interconnect 
the send receive benchmark is run on an eight node (16 cpu) Linux cluster connected with Gigabit ethernet (see Figure 
These results are shown for one and 16 cpus. MatlabMPI is able to maintain high bandwidth even when multiple processors 
are communicating by allowing each processor to have its own receive directory. By cross mounting all the disks in a cluster 
each node only sees the traffic directed to it, which allows communication contention to be kept to a minimum. 

4 Scaling 

To further test MatlabMPI a simple image filtering application was written. This application abstracts the key computations 
that are used in many of our DoD sensor processing applications (e.g. wide area Synthetic Aperture Radar). The application 
executed repeated 2D convolution on a large image (1024 x 128,000 ~2 GBytes). This application was run on a large 
shared memory parallel computer (the SGI Origin2000 at Boston University) and achieved speedups greater than 64 on 
64 processors; showing the classic super-linear speedup (due to better cache usage) that comes from breaking a very large 
memory problem into many smaller problems (see Figure |5|). 

To further test the scalability, the image processing application was run with a constant load per processor (1024 x 1024 
image per processor) on a large shared/distributed memory system (the IBM SP2 at the Maui High Performance Computing 
Center). In this test, the application achieved a speedup of ~300 on 304 CPUs as well achieving ~15% of the theoretical 
neak ^450 Gieaflons') of the svstem ("see Figure l6h. 



The ultimate goal of running Matlab on parallel computers is to increase programmer productivity and decrease the large 
software cost of using HPC systems. Figure ^ plots the software cost (measured in Software Lines of Code or SLOCs) as 
a function of the maximum achieved performance (measured in units of single processor peak) for the same image filtering 
application implemented using several different libraries and languages (VSIPL, MPI, OpenMP, using C++, C, and Matlab 
[^]). These data show that higher level languages require fewer hnes to implement the same level of functionality. Obtaining 
increased peak performance (i.e. exploiting more parallelism) requires more lines of code. MatlabMPI is unique in that it 
achieves a high peak performance using a small number of lines of code. 

Two useful metrics we have developed for measuring software productivity are Gigaflops/SLOC and CPUs/SLOC. Gi- 
gaflops/SLOC will increase in a given application if it can be made to run faster or can be implemented with fewer lines of 
code. CPUs/SLOC will increase if the number of processors that can be effectively used in an application can be increased 
without increasing the number of lines of code. The test application does extremely well in both of these measures, achieving 
0.85 Gigaflops/SLOC and 4.4 CPUs/SLOC, both of which are the highest values recorded for any apphcation. 

5 Conclusions and Future Work 

The use of file I/O as a parallel communication mechanism is not new and is now increasingly feasible with the availability of 
low cost high speed disks. The extreme example of this approach are the now popular Storage Area Networks (SAN), which 
combine high speed routers and disks to provide server solutions. Although using file I/O increases the latency of messages 
it normally will not effect the bandwidth. Furthermore, the use of file I/O has several additional functional advantages which 
make it easy to implement large buffer sizes, recordable messages, multi-casting, and one-sided messaging. Finally, the 
MatlabMPI approach is readily applied to any language (e.g. IDL, Python, and Perl). 

MatlabMPI demonstrates that the standard approach to writing parallel programs in C and Fortran (i.e. using MPI) is also 
valid in Matlab. In addition, by using Matlab file I/O, it was possible to implement MatlabMPI entirely within the Matlab 
environment, making it instantly portable to all computers that Matlab runs on. Most potential parallel Matlab applications 
are trivially parallel and don't require high performance. Never-the-less, MatlabMPI can match C MPI performance on large 
messages. The simplicity and performance of MatlabMPI makes it a very reasonable choice for programmers that want to 
speed up their Matlab code on a parallel computer. 

MatlabMPI provides the highest productivity parallel computing environment available. However, because it is a point- 
to-point messaging library, a significant amount code of must be added to any application in order to do basic parallel 
operations. In the test application presented here, the number of lines of Matlab code increased from 35 to 70. While a 
70 line parallel program is extremely small, it represents a significant increase over the single processor case. Our future 
work will aim at creating higher level objects (e.g. distributed matrices) that will eliminate this parallel coding overhead (see 
Figure ||). The resulting "Parallel Matlab Toolbox" will be built on top of the MatlabMPI communication layer, and will 
allow a user to achieve good parallel performance without increasing the number lines of code. 

Acknowledgments 

The authors would like to thank the sponsorship of the DoD High Performance Computing Modernization Office, and the 
staff at the supercomputing centers at Boston University, Ohio State University and the Maui High Performance Computing 
Center. 

A Parallel Image Filtering Test Application 

MPI_Init; % Initialize MPI. 

comm = MPI_COMM_WORLD; % Create communicator. 
comm_size = MPI_Comm_size (comm) ; % Get size. 
my_rank = MPI_Comm_rank (comm) ; % Get rank. 
% Do a synchronized start. 
starter_rank = 0; 
delay = 30; % Seconds 

synch_start (comm, starter_rank, delay) ; 

n imaae x = 2 . " ( 1 0+1 ) *comm size; % Set imaae size (use nowers of 2). 



n_image_y = 2 . " 1 ; 

n_point = 100; % Number of points to put in each sub-image. 
% Set filter size (use powers of 2) . 
n_filter_x = 2. "5; 
n_filter_y = 2. '5; 

n_trial =2; % Set the number of times to filter. 
% Computer number of operations. 

total_ops = 2 . *n_trial*n_f ilter_x*n_f ilter_y*n_image_x*n_image_y; 
if (rem (n_image_x, comm_size) ~= 0) 

dispCERROR: processors need to evenly divide image'); 

exit ; 
end 

disp ( [ ' my_rank; : ' , num2str (my_rank;) ] ) ; % Print rank. 

left = my_rank - 1; % Set who is source and who is destination. 

if (left < 0) 

left = comm_size - 1; 
end 

right = my_rank + 1; 
if (right >= comm_size) 

right = 0; 
end 

tag = 1; % Create a unique tag id for this message. 
start_time = zeros (n_trial) ; % Create timing matrices. 
end_time = start_time; 

zero_clock = clock; % Get a zero clock. 

n_sub_image_x = n_image_x . /comm_size; % Compute sub_images for each processor. 
n_sub_image_y = n_image_y; 

% Create starting image and working images.. 
sub_imageO = rand (n_sub_image_x, n_sub_image_y) . '10; 

sub_image = sub_imageO; 

work_image = zeros (n_sub_image_x+n_f ilter_x, n_sub_image_y+n_filter_y) ; 
% Create kernel. 

x_shape = sin (pi . * (0 : (n_f ilter_x-l) ) . / (n_filter_x-l) ) . '2; 
y_shape = sin (pi . * (0 : {n_f ilter_y-l) ) . / (n_filter_y-l) ) . '2; 
kernel = x_shape . ' * y_shape; 
% Create box indices . 

Iboxw = [ 1 , n_f ilter_x/2 , 1 , n_sub_image_y ] ; 

cboxw = [n_f ilter_x/2+l, n_f ilter_x/2+n_sub_image_x, 1, n_sub_image_y] ; 

rboxw = [n_f ilter_x/2+n_sub_image_x+l, n_sub_image_x+n_f ilter_x, 1, n_sub_image_y] ; 

Iboxi = [ 1 , n_f ilter_x/2 , 1 , n_sub_image_y ] ; 

rboxi = [n_sub_image_x-n_f ilter_x/2+l , n_sub_image_x, 1 , n_sub_image_y ] ; 
start_time = etime (clock, zero_clock) ; % Set start time. 
% Loop over each trial, 
for i_trial = l:n_trial 

% Copy center sub_image into work_image. 

work_image (cboxw (1) : cboxw (2) , cboxw (3) : cboxw (4) ) = sub_image; 
if (comm_size > 1) 

Itag = 2.*i_trial; % Create message tag. 

rtag = 2 . *i_trial+l ; 

% Send left sub-image. 

l_sub_image = sub_image ( Iboxi { 1 ): Iboxi (2 ), Iboxi (3) : Iboxi (4 )) ; 
MPI_Send( left, Itag, comm, l_sub_image ); 
% Receive right padding. 



r_pad = MPI_Recv{ right, Itag, comm ) ; 

work_image (rboxw { 1 ) : rboxw (2 ) , rboxw { 3 ) : rboxw { 4 ) ) = r_pad; 
% Send right sub-image. 

r_sub_image = sub_image { rboxi { 1 ) : rboxi ( 2 ) , rboxi { 3 ) : rboxi { 4 ) ) ; 

MPI_Send{ right, rtag, comm, r_sub_image ) ; 

% Receive left padding. 

l_pad = MPI_Recv{ left, rtag, comm ) ; 

work;_image (Iboxw { 1 ) : Iboxw (2 ) , Iboxw (3 ) : Iboxw ( 4 ) ) = l_pad; 

end 

work_image = conv2 {work_image, kernel, ' same' ) ; % Compute convolution. 
% Extract sub_image. 

sub_image = work_image (cboxw (1) : cboxw (2) , cboxw (3) : cboxw (4) ) ; 
end 

end_time = etime (clock, zero_clock) ; % Get end time for the this message. 

total_time = end_time - start_time % Print the results. 

% Print compute performance. 

gigaflops = total_ops / total_time / l.e9; 

disp { [ ' GigaFlops : ' , num2str (gigaflops) ] ) ; 

MPI_Finalize; % Finalize Matlab MPI. 

exit ; 
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MPI_Send (variable, dest, tag, comm); 

Sender 



i 



Shared File System 



variable 



savei 



create i 



Data file 



Lock file 



iload 



Receiver 



variable 



i detect 



variable = iMPi Recv (source, tag, comm); 



Figure 1: File Based Communication. Sender saves a variable in a Data file, then creates Lock file. Receiver detects Lock 
file, then loads Data file. 
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Figure 2: MatlabMPI vs. MPI Bandwidth. Communication performance as a function message size on the SGI Ori- 
gin2000. MatlabMPI equals C MPI performance at large message sizes. 
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Figure 3: Bandwidth Comparison. Bandwidth measured on Ohio St. SGI Origin2000 and a Hewlett Packard workstation 
cluster. 




Figure 4: Bandwidth on a Linux Cluster. Communication performance on Linux cluster with Gigabit ethemet. MatlabMPI 
is able to maintain communication performance even when larger numbers of processor are communicating. 
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Figure 5 : Shared Memory Parallel Speedup. Speed increase on the SGI Origin2000 of a parallel image filtering applica- 
tion as a function of the number of processors. Application shows "classic" super-linear performance (due to better cache 
usage) that results when a very large memory problem is broken into multiple small memory problems. 
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Figure 6: Shared/Distributed Parallel Speedup. Measure performance on the IBM SP2 of a parallel image filtering 
application. Application achieves a speedup of ~300 on 304 processors and ~15% of the theoretical peak (450 Gigaflops). 
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Figure 7: Productivity vs Performance. Lines of code as a function maximum achieved performance (measured in units 
of single processor theoretical peak) for different implementations of the same image filtering application. Higher level 
languages allow the same application to be implemented with fewer lines of code. Increasing the maximum performance 
generally increases the lines of code. MatlabMPI allows high level languages to run on multiple processors. Parallel Matlab 
is an estimate of what will be possible using the Parallel Matlab Toolbox. Abbreviations: VSIPL (Vector, Signal, and Image 
Processing Library) see www.vsipl.org; PVL (Parallel Vector Library) see www.hpec-si.org. 
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Figure 8: Future Layered Architecture. Design of the Parallel Matlab Toolbox which will create distributed data structures 
and dataflow objects built on top of MatlabMPI (or any other messaging system). 



